5
Applications in Natural Language
Processing
5.1
Background
We first overview the background of three aspects of this section: quantization-aware train-
ing for the low-bit language model, post-training quantization for the low-bit language
model, and binary language model.
5.1.1
Quantization-Aware Training (QAT) for Low-Bit Large Language
Models
Large pre-trained language models have achieved remarkable success in various natural
language processing tasks resorting to the increasing model size and computation over-
head [227, 54, 21], which make it prohibitive to deploy these language models on many
resource-constrained devices. To make the deployment of existing language models possible,
various model compression techniques have been proposed, such as pruning [64, 172, 244],
knowledge distillation [107, 217], weight-sharing [51, 125, 98], dynamic computation with
adaptive depth or width [88, 255, 298], and network quantization [285, 221, 195, 6]. Among
these techniques, network quantization enjoys the merit of reducing the size of the model
and the computation overhead without modifying the network architecture. It thus receives
extensive favor, and many methods have been explored to quantify language models.
For now, most language model quantization methods follow quantization-aware training
(QAT), in which the full-precision model is trained for an entire training process. In practice,
such QAT-based methods usually perform better than other quantization paradigms, such
as post-training quantization (PTQ).
5.1.2
Post-Training Quantization (PTQ) for Low-Bit Large Language
Models
Despite QAT producing a satisfactory performance for large language models compared
with post-training quantization (PTQ), which relies on a small calibration set to perform
quantization, it often suffers from several issues. Specifically, QAT usually conducts end-to-
end back-propagation training over the whole training set, which can be slow in training
time, memory demanding, and data consuming. These issues can sometimes be prohibited
for industrial language models.
Compared with the PTQ method, QAT mainly has drawbacks in three aspects: training
time, memory demand, and data consumption. First, QAT conducts training over the entire
training set, so it takes much more time than PTQ over the calibration set. Moreover, recent
QAT methods [6, 285] further combine two-stage knowledge distillation [107], which can
DOI: 10.1201/9781003376132-5
118